fairness analysis
Fairness-Driven LLM-based Causal Discovery with Active Learning and Dynamic Scoring
Causal discovery (CD) plays a pivotal role in numerous scientific fields by clarifying the causal relationships that underlie phenomena observed in diverse disciplines. Despite significant advancements in CD algorithms that enhance bias and fairness analyses in machine learning, their application faces challenges due to the high computational demands and complexities of large-scale data. This paper introduces a framework that leverages Large Language Models (LLMs) for CD, utilizing a metadata-based approach akin to the reasoning processes of human experts. By shifting from pairwise queries to a more scalable breadth-first search (BFS) strategy, the number of required queries is reduced from quadratic to linear in terms of variable count, thereby addressing scalability concerns inherent in previous approaches. This method utilizes an Active Learning (AL) and a Dynamic Scoring Mechanism that prioritizes queries based on their potential information gain, combining mutual information, partial correlation, and LLM confidence scores to refine the causal graph more efficiently and accurately. This BFS query strategy reduces the required number of queries significantly, thereby addressing scalability concerns inherent in previous approaches. This study provides a more scalable and efficient solution for leveraging LLMs in fairness-driven CD, highlighting the effects of the different parameters on performance. We perform fairness analyses on the inferred causal graphs, identifying direct and indirect effects of sensitive attributes on outcomes. A comparison of these analyses against those from graphs produced by baseline methods highlights the importance of accurate causal graph construction in understanding bias and ensuring fairness in machine learning systems.
- Research Report > Experimental Study (0.66)
- Research Report > New Finding (0.46)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.68)
- Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.47)
Towards Responsible AI in Education: Hybrid Recommendation System for K-12 Students Case Study
Drushchak, Nazarii, Tyshchenko, Vladyslava, Polyakovska, Nataliya
--The growth of Educational T echnology (EdT ech) has enabled highly personalized learning experiences through Artificial Intelligence (AI)-based recommendation systems tailored to each student's needs. However, these systems can unintentionally introduce biases, potentially limiting fair access to learning resources. This study presents a recommendation system for K-12 students, combining graph-based modeling and matrix factorization to provide personalized suggestions for extracurricular activities, learning resources, and volunteering opportunities. T o address fairness concerns, the system includes a framework to detect and reduce biases by analyzing feedback across protected student groups. This work highlights the need for continuous monitoring in educational recommendation systems to support equitable, transparent, and effective learning opportunities for all students. I NTRODUCTION The rapid advancement of Educational Technology (EdTech) has significantly reshaped traditional learning environments, enabling the delivery of personalized educational experiences tailored to individual students' needs. According to the U.S. Department of Education Office of Educational Technology, leveraging AI-based modern educational technologies has been pivotal in providing personalized pathways for learning, supporting adaptive and individualized instruction, and enhancing student engagement through innovative digital solutions 1 . This trend toward personalization in education underscores the importance of leveraging advanced recommendation systems to support student exploration and growth.
- North America > United States > Texas > Travis County > Austin (0.04)
- Europe > Ukraine > Lviv Oblast > Lviv (0.04)
- Europe > Poland > Masovia Province > Warsaw (0.04)
- Instructional Material (1.00)
- Overview (0.68)
- Research Report (0.65)
- Education > Educational Technology (1.00)
- Education > Educational Setting > K-12 Education (0.85)
- Government > Regional Government > North America Government > United States Government (0.74)
Testing for Causal Fairness
Fu, Jiarun, Ding, LiZhong, Li, Pengqi, Wei, Qiuning, Cheng, Yurong, Chen, Xu
Causality is widely used in fairness analysis to prevent discrimination on sensitive attributes, such as genders in career recruitment and races in crime prediction. However, the current data-based Potential Outcomes Framework (POF) often leads to untrustworthy fairness analysis results when handling high-dimensional data. To address this, we introduce a distribution-based POF that transform fairness analysis into Distributional Closeness Testing (DCT) by intervening on sensitive attributes. We define counterfactual closeness fairness as the null hypothesis of DCT, where a sensitive attribute is considered fair if its factual and counterfactual potential outcome distributions are sufficiently close. We introduce the Norm-Adaptive Maximum Mean Discrepancy Treatment Effect (N-TE) as a statistic for measuring distributional closeness and apply DCT using the empirical estimator of NTE, referred to Counterfactual Fairness-CLOseness Testing ($\textrm{CF-CLOT}$). To ensure the trustworthiness of testing results, we establish the testing consistency of N-TE through rigorous theoretical analysis. $\textrm{CF-CLOT}$ demonstrates sensitivity in fairness analysis through the flexibility of the closeness parameter $\epsilon$. Unfair sensitive attributes have been successfully tested by $\textrm{CF-CLOT}$ in extensive experiments across various real-world scenarios, which validate the consistency of the testing.
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Asia > China > Beijing > Beijing (0.04)
- Information Technology > Security & Privacy (0.93)
- Health & Medicine (0.67)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)
- Information Technology > Security & Privacy (0.93)
- Information Technology > Artificial Intelligence > Natural Language (0.68)
Fairness Analysis of CLIP-Based Foundation Models for X-Ray Image Classification
Sun, Xiangyu, Zou, Xiaoguang, Wu, Yuanquan, Wang, Guotai, Zhang, Shaoting
X-ray imaging is pivotal in medical diagnostics, offering non-invasive insights into a range of health conditions. Recently, vision-language models, such as the Contrastive Language-Image Pretraining (CLIP) model, have demonstrated potential in improving diagnostic accuracy by leveraging large-scale image-text datasets. However, since CLIP was not initially designed for medical images, several CLIP-like models trained specifically on medical images have been developed. Despite their enhanced performance, issues of fairness - particularly regarding demographic attributes - remain largely unaddressed. In this study, we perform a comprehensive fairness analysis of CLIP-like models applied to X-ray image classification. We assess their performance and fairness across diverse patient demographics and disease categories using zero-shot inference and various fine-tuning techniques, including Linear Probing, Multilayer Perceptron (MLP), Low-Rank Adaptation (LoRA), and full fine-tuning. Our results indicate that while fine-tuning improves model accuracy, fairness concerns persist, highlighting the need for further fairness interventions in these foundational models.
INFELM: In-depth Fairness Evaluation of Large Text-To-Image Models
Jin, Di, Liu, Xing, Liu, Yu, Yap, Jia Qing, Wong, Andrea, Crespo, Adriana, Lin, Qi, Yin, Zhiyuan, Yan, Qiang, Ye, Ryan
The rapid development of large language models (LLMs) and large vision models (LVMs) have propelled the evolution of multi-modal AI systems, which have demonstrated the remarkable potential for industrial applications by emulating human-like cognition. However, they also pose significant ethical challenges, including amplifying harmful content and reinforcing societal biases. For instance, biases in some industrial image generation models highlighted the urgent need for robust fairness assessments. Most existing evaluation frameworks focus on the comprehensiveness of various aspects of the models, but they exhibit critical limitations, including insufficient attention to content generation alignment and social bias-sensitive domains. More importantly, their reliance on pixel-detection techniques is prone to inaccuracies. To address these issues, this paper presents INFELM, an in-depth fairness evaluation on widely-used text-to-image models. Our key contributions are: (1) an advanced skintone classifier incorporating facial topology and refined skin pixel representation to enhance classification precision by at least 16.04%, (2) a bias-sensitive content alignment measurement for understanding societal impacts, (3) a generalizable representation bias evaluation for diverse demographic groups, and (4) extensive experiments analyzing large-scale text-to-image model outputs across six social-bias-sensitive domains. We find that existing models in the study generally do not meet the empirical fairness criteria, and representation bias is generally more pronounced than alignment errors. INFELM establishes a robust benchmark for fairness assessment, supporting the development of multi-modal AI systems that align with ethical and human-centric principles.
- North America > United States > Alaska (0.04)
- Europe > Germany > Bavaria > Upper Bavaria > Munich (0.04)
- Government > Regional Government > North America Government > United States Government (0.46)
- Health & Medicine (0.46)
- Social Sector (0.34)
Revealed: bias found in AI system used to detect UK benefits fraud
An artificial intelligence system used by the UK government to detect welfare fraud is showing bias according to people's age, disability, marital status and nationality, the Guardian can reveal. An internal assessment of a machine-learning programme used to vet thousands of claims for universal credit payments across England found it incorrectly selected people from some groups more than others when recommending whom to investigate for possible fraud. The admission was made in documents released under the Freedom of Information Act by the Department for Work and Pensions (DWP). The "statistically significant outcome disparity" emerged in a "fairness analysis" of the automated system for universal credit advances carried out in February this year. The emergence of the bias comes after the DWP this summer claimed the AI system "does not present any immediate concerns of discrimination, unfair treatment or detrimental impact on customers". This assurance came in part because the final decision on whether a person gets a welfare payment is still made by a human, and officials believe the continued use of the system – which is attempting to help cut an estimated 8bn a year lost in fraud and error – is "reasonable and proportionate".
Monitoring fairness in machine learning models that predict patient mortality in the ICU
van Schaik, Tempest A., Liu, Xinggang, Atallah, Louis, Badawi, Omar
Benchmarking can include comparing an ICU's actual performance with predicted performance. The increased interoperability of medical devices, electronic health records (EHRs) and information systems has improved the acquisition and presentation of data to healthcare professionals. This data has enabled the training of predictive models. However, thi s plethora of data sources has also introduced new risks that societal bias will lead to machine learning systems with fairness issues for patient groups. In addition, when variations in data documentation are non-random, significant bias can be introduced, improving, or worsening measured performance for an institution relative to peers. This work focuses on ICU mortality benchmarking. In particular, we analyze the fairness of a model based on Generalised Additiv e Models (GAM) [ 3 ] that predicts mortality in the ICU. This model is used to compare actual versus predicted outcom es to assess ICU performance.
- Asia > Singapore (0.05)
- North America > United States > New York > New York County > New York City (0.04)
- Health & Medicine > Therapeutic Area (1.00)
- Health & Medicine > Health Care Technology > Medical Record (0.54)
Local Causal Discovery for Structural Evidence of Direct Discrimination
Maasch, Jacqueline, Gan, Kyra, Chen, Violet, Orfanoudaki, Agni, Akpinar, Nil-Jana, Wang, Fei
Fairness is a critical objective in policy design and algorithmic decision-making. Identifying the causal pathways of unfairness requires knowledge of the underlying structural causal model, which may be incomplete or unavailable. This limits the practicality of causal fairness analysis in complex or low-knowledge domains. To mitigate this practicality gap, we advocate for developing efficient causal discovery methods for fairness applications. To this end, we introduce local discovery for direct discrimination (LD3): a polynomial-time algorithm that recovers structural evidence of direct discrimination. LD3 performs a linear number of conditional independence tests with respect to variable set size. Moreover, we propose a graphical criterion for identifying the weighted controlled direct effect (CDE), a qualitative measure of direct discrimination. We prove that this criterion is satisfied by the knowledge returned by LD3, increasing the accessibility of the weighted CDE as a causal fairness measure. Taking liver transplant allocation as a case study, we highlight the potential impact of LD3 for modeling fairness in complex decision systems. Results on real-world data demonstrate more plausible causal relations than baselines, which took 197x to 5870x longer to execute.
- North America > United States > California (0.04)
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
- Europe > Spain (0.04)
- Asia > Middle East > Jordan (0.04)
- Research Report > Experimental Study (0.69)
- Research Report > New Finding (0.46)
- Health & Medicine > Therapeutic Area > Hepatology (0.49)
- Health & Medicine > Therapeutic Area > Nephrology (0.46)
- Health & Medicine > Therapeutic Area > Oncology (0.46)
Fairness Hub Technical Briefs: AUC Gap
Lee, Jinsook, Brooks, Chris, Yu, Renzhe, Kizilcec, Rene
Jinsook Lee, Chris Brooks, Renzhe Yu, Rene Kizilcec To measure bias, we encourage teams to consider using AUC Gap: the absolute difference between the highest and lowest test AUC for subgroups (e.g., gender, race, SES, prior knowledge). It is agnostic to the AI/ML algorithm used and it captures the disparity in model performance for any number of subgroups, which enables non-binary fairness assessments such as for intersectional identity groups. The teams use a wide range of AI/ML models in pursuit of a common goal of doubling math achievement in low-income middle schools. Ensuring that the models, which are trained on datasets collected in many different contexts, do not introduce or amplify biases is important for achieving the goal. We offer here a versatile and easy-to-compute measure of model bias for all teams in order to create a common benchmark and an analytical basis for sharing what strategies have worked for different teams.
Revisiting Skin Tone Fairness in Dermatological Lesion Classification
Kalb, Thorsten, Kushibar, Kaisar, Cintas, Celia, Lekadir, Karim, Diaz, Oliver, Osuala, Richard
Addressing fairness in lesion classification from dermatological images is crucial due to variations in how skin diseases manifest across skin tones. However, the absence of skin tone labels in public datasets hinders building a fair classifier. To date, such skin tone labels have been estimated prior to fairness analysis in independent studies using the Individual Typology Angle (ITA). Briefly, ITA calculates an angle based on pixels extracted from skin images taking into account the lightness and yellow-blue tints. These angles are then categorised into skin tones that are subsequently used to analyse fairness in skin cancer classification. In this work, we review and compare four ITA-based approaches of skin tone classification on the ISIC18 dataset, a common benchmark for assessing skin cancer classification fairness in the literature. Our analyses reveal a high disagreement among previously published studies demonstrating the risks of ITA-based skin tone estimation methods. Moreover, we investigate the causes of such large discrepancy among these approaches and find that the lack of diversity in the ISIC18 dataset limits its use as a testbed for fairness analysis. Finally, we recommend further research on robust ITA estimation and diverse dataset acquisition with skin tone annotation to facilitate conclusive fairness assessments of artificial intelligence tools in dermatology.
- Europe > Switzerland (0.04)
- South America (0.04)
- Oceania > Australia (0.04)
- (6 more...)
- Research Report (0.64)
- Overview (0.48)
- Health & Medicine > Therapeutic Area > Dermatology (1.00)
- Health & Medicine > Therapeutic Area > Oncology > Skin Cancer (0.57)